Gang migration of virtual machines using cluster-wide deduplication

ABSTRACT

Datacenter clusters often employ live virtual machine (VM) migration to efficiently utilize cluster-wide resources. Gang migration refers to the simultaneous live migration of multiple VMs from one set of physical machines to another in response to events such as load spikes and imminent failures. Gang migration generates a large volume of network traffic and can overload the core network links and switches in a data center. The present technology reduces the network overhead of gang migration using global deduplication (GMGD). GMGD identifies and eliminates the retransmission of duplicate memory pages among VMs running on multiple physical machines in the cluster. A prototype GMGD reduces the network traffic on core links by up to 51% and the total migration time of VMs by up to 39% when compared to the default migration technique in QEMU/KVM, with reduced adverse performance impact on network-bound applications.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a Continuation of U.S. patent applicationSer. No. 14/137,131, filed Dec. 20, 2013, now U.S. Pat. No. 9,372,726,issued Jun. 21, 2016, which is a non-provisional of, and claims benefitof priority under 35 U.S.C. §119(e) from U.S. Provisional PatentApplication No. 61/750,450, filed Jan. 9, 2013, the entirety of whichare each expressly incorporated herein by reference.

STATEMENT OF GOVERNMENT SUPPORT

This invention was made with government support under award 0845832awarded by the National Science Foundation. The government has certainrights in this invention.

BACKGROUND OF THE INVENTION

Live migration of virtual machines (VMs) is a critical activity in theoperation of modern data centers. Live migration involves the transferof multiple Gigabytes of memory within a short duration (assuming thatnetwork attached storage is used, which does not require migration) andcan consequently consume significant amounts of network and CPUresources.

An administrator may need to simultaneously migrate multiple VMs toperform resource re-allocation to handle peak workloads, imminentfailures, cluster maintenance, or powering down an entire rack to saveenergy, Simultaneous live migration of multiple VMs is referred to asgang migration[8]. Gang migration is a network intensive activity thatcan cause an adverse cluster-wide impact by overloading the core linksand switches of the datacenter network. Gang migration can also affectthe performance at the network edges where the migration trafficcompetes with the bandwidth requirements of applications within the VMs.Hence it is important to minimize the adverse performance impact of gangmigration by reducing the total amount of data transmitted due to VMmigration. Reducing the VM migration traffic can also lead to areduction in the total time required to migrate multiple VMs.

Process migration has also been extensively researched. Numerous clusterjob schedulers exist, as well as virtual machine management systems,such as VMWare's DRS, XenEnterprise, Usher, Virtual Machine ManagementPack, and CoD that let administrators control jobs/VM placement based oncluster load or specific policies such as affinity or anti-affinityrules.

Wood [27] optimizes the live migration of a single VM over wide-areanetwork through a variant of stop-and-copy approach which reduces thenumber of memory copying iterations. Zhang [30] and Wood [27] furtheruse page-level deduplication along with the transfer of differencesbetween dirtied and original pages, eliminating the need to retransmitthe entire dirtied page. Jin [16]uses an adaptive page compressiontechnique to optimize the live migration of a single VM. Post-copy Hines[13] transfers every page to the destination only once, as opposed tothe iterative pre-copy Nelson [20], Clark [5], which transfers dirtiedpages multiple times. Huang [14] employs low-overhead RDMA overInfiniband to speed up the transfer of a single VM. Nocentino [21]excludes the memory pages of processes communicating over the networkfrom being transferred during the initial rounds of migration, thuslimiting the total migration time. Xia [29] shows that certainbenchmarks used in high performance computing are likely to have largeamounts of content sharing. The work focuses mainly on the opportunityand feasibility of exploiting content sharing, but does not provide animplementation of an actual migration mechanism using this observation,nor does it evaluate the migration time or network traffic reduction.Riteau [22] Shrinker migrates virtual clusters over high-delay links ofWAN. It uses an online hashing mechanism in which hash computation foridentifying duplicate pages (a CPU-intensive operation) is performedduring the migration.

The following US patents and published patent applications are expresslyincorporated herein in their entirety: 20130339407; 20130339390;20130339310; 20130339300; 20130339299; 20130339298; 20130332685;20130332660; 20130326260; 20130326159; 20130318051; 20130315260;20130297855; 20130297854; 20130290267; 20130282662; 20130263289;20130262801; 20130262638; 20130262615; 20130262410; 20130262396;20130262394; 20130262392; 20130262390; 20130262386; 20130262385;20130254402; 20130253977; 20130246366; 20130246360; 20130238575;20130238572; 20130238563; 20130238562; 20130232215; 20130227352;20130212437; 20130212200; 20130198459; 20130159648; 20130159645;20130151484; 20130138705; 20130132967; 20130132531; 20130125120;20130121209; 20130117240; 20130111262; 20130110793; 20130110779;20130110778; 20130097380; 20130097377; 20130086353; 20130086269;20130086006; 20130080728; 20130080408; 20130061014; 20130055249;20130055248; 20130054932; 20130054927; 20130054910; 20130054906;20130054890; 20130054889; 20130054888; 20130054545; 20130046949;20130042052; 20130041872.; 20130031563; 20130031331; 20130024645;20130024424; 20120331021; 20120290950; 20120284236; 20120254119;20120240110; 20120239871; 20120213069; 20120102455; 20120089764;20120084595; 20120084527; 20120084507; 20120084506; 20120084505;20120084504; 20120084270; 201.20084262; 20120084261; 20120079318;201.20079190; 20120079189; 20120017114; 20120017027; 20120011176;20110238775; 20110179415; 20110167221; 20110161723; 20110161299;20110161297; 20110161295; 20110161291; 20110087874; 20100333116;20100332818; 20100332658; 20100332657; 20100332479; 20100332456;20100332454; 20100332401; 20100274772; 20100241807; 20100241726;20100241673; 20100241654; 20100106691; 20100070725; 20100070528;20100011368; 20090240737; 20060069717; 20060010195; 20050262194;20050262193; 20050262192; 20050262191; 20050262190; 20050262189;20050262188; 20050240592; 20050240354; 20050235274; 20050234969;20050232046; 20050228808; 20050223109; 20050222931; U.S. Pat. Nos.8,612,439; 8,601,473; 8,600,947; 8,595,460; 8,595,346; 8,595,191;8,589,640; 8,577,918; 8,566,640; 8,554,918; 8,549,518; 8,549,350;8,549,245; 8,533,231; 8,527,544; 8,516,158; 8,504,870; 8,504,791;8,504,670; 8,489,744; 8,484,505; 8,484,356; 8,484,249; 8,463,991;8,453,031; 8,452,932; 8,452,731; 8,442,955; 8,433,682; 8,429,651;8,429,649; 8,429,360; 8,429,307; 8,413,146; 8,407,428; 8,407,190;8,402,309; 8,402,306; 8,375,003; 8,335,902; 8,332,689; 8,311,985;8,307,359; 8,307,177; 8,285,681; 8,239,584; 8,209,506; 8,166,265;8,135,930; 8,060,553; 8,060,476; 8,046,550; 8,041,760; 7,814,470; and7,814,142.

SUMMARY OF THE INVENTION

The present technology provides, for example, live gang migration ofmultiple VMs that run on multiple physical machines, which may be in acluster or separated by a local area network or wide area network. Acluster is assumed to have a high-bandwidth low-delay interconnect suchhas Gigabit Ethernet [10], 10 GigE [9], or Infiniband [15]. Wide AreaNetworks tend to have lower throughput, lower communicationsreliability, and higher latency than communications within a cluster.One approach to reducing the network traffic due to gang migration usesthe following observation. VMs within a cluster often have similarmemory content, given that they may execute the same operating system,libraries, and applications. Hence, a significant number of their memorypages may be identical Waldspurger [25]. Similarly, VMs communicatingover less constrained networks may also share memory content.

One can reduce the network overhead of gang migration usingdeduplication, i.e. by avoiding the transmission of duplicate copies ofidentical pages. One approach is called gang migration using globaldeduplication (GMGD), which performs deduplication during the migrationof VMs that run on different physical machines. In contrast, gangmigration using local deduplication (GMLD) refers to deduplicating themigration of VMs running within a single host[8].

Various aspects which may be used include: A technique to identify andtrack identical memory content across VMs running on different physicalmachines in a cluster, including non-migrating VMs running on the targetmachines; and a technique to deduplicate this identical memory contentduring the simultaneous live migration of multiple VMs, while keepingthe coordination overhead low.

For example, an implementation of GMGD may be provided on theQEMU/KVM[18] platform. A quantitative evaluation of GMGD on a 30-nodecluster test bed having 10 GigE core links and 1 Gbps edge links wasperformed, comparing GMGD against two techniques the QEMU/KVM's defaultlive migration technique, called online compression (OC), and GMLD.

Prior efforts to reduce the data transmitted during VM migration havefocused on live migration of a single V M Clark [5], Nelson [20], Hines[13], Jin [16], live migration of multiple VMs running on the samephysical machine (GMLD) Deshpande [8], live migration of a virtualcluster across a wide-area network (WAN) Riteau [22], or non-livemigration of multiple VM images across a WAN Samer [17]. Compared toGMLD, GMGD faces the additional challenge of ensuring that the cost ofglobal deduplication does not exceed the benefit of network trafficreduction during the live migration. The deduplication cost may becalculated, inferred or presumed. In contrast to migration over a WAN,which has high-bandwidth high-delay links, migration within a datacenterLAN has high-bandwidth low-delay links. This difference is importantbecause hash computations, which are used to identify and deduplicateidentical memory pages, are CPU-intensive operations. When migratingover a LAN, hash computations become a serious bottleneck if performedon line during migration, whereas over a WAN, the large round-triplatency can mask the online hash computation overhead.

First, a distributed duplicate tracking phase identifies and tracksidentical memory content across VMs running on same/different physicalmachines in a cluster, including non-migrating migrating VMs running onthe target machines. The key challenge here is a distributed indexingmechanism that computes content hashes on VMs' memory content ondifferent machines and allows individual nodes to efficiently query andlocate identical pages. Two options are a distributed hash table or acentralized indexing server, both of which have their relative meritsand drawbacks. The former prevents a single point of bottleneck/failure,whereas the latter simplifies the overall indexing and lookup operationduring runtime.

Secondly, a distributed deduplication phase, during the migration phase,avoids the need for re-transmission of identical memory content, whichwas identified in the first step, during the simultaneous live migrationof multiple VMs. The goal here is to reduce the network trafficgenerated by migration of multiple VMs by eliminating the retransmissionof identical pages from different VMs. Note that the deduplicationoperation would itself introduce control traffic to identify whichidentical pages have already been transferred from the source to thetarget racks. This control traffic overhead is minimized in terms ofboth additional bandwidth and latency introduced due to synchronization.

Deduplication has been used to reduce the memory footprint of VMs inBugnion [3], Waldspurger [25], Milos [19], Arcangeli [1], Wood [28] andGupta [11]. These techniques use deduplication to reduce memoryconsumption either within a single VM or between multiple co-locatedVMs. In contrast, the present technology uses cluster-wide deduplicationacross multiple physical machines to reduce the network traffic overheadwhen simultaneously migrating multiple VMs. Non-live migration of asingle VM can be speeded up by using content hashing to detect blockswithin the VM image that are already present at the destinationSapuntzakis [2.3]. VM-Flock Samer [17] speeds up the no migration of agroup of VM images over a high-bandwidth high-delay wide-area network bydeduplicating blocks across the VM images. In contrast, one embodimentof the present technology focuses on reducing the network performanceimpact of the live and simultaneous migration of the memories ofmultiple VMs within a high-bandwidth low-delay datacenter network. Thetechnology can of course be extended outside of these presumptions.

In the context of live migration of multiple VMs, (WILD Deshpande [8]deduplicates the transmission of identical memory content among VMsco-located within a single host. It also exploits sub-page leveldeduplication, page similarity, and delta difference for dirtied pages,all of which can be integrated in GMGD.

The large round-trip latency of WAN links masks the high hashcomputation overhead during migration, and therefore makes onlinehashing feasible. Over low-delay links, e.g., Gigabit Ethernet LAN,offline hashing appears preferable.

Gang migration with global deduplication (GMGD) provides a solution toreduce the network load resulting from the simultaneous live migrationof multiple VMs within a datacenter that has high-bandwidth low-latencyinterconnect, and has implications for other environments. Thetechnology employs cluster-wide deduplication to identify, track, andavoid the retransmission of pages that have identical content.Evaluations of a GMGD prototype on a 30 node cluster show that GMGDreduces the amount of data transferred over the core links duringmigration by up to 51% and the total migration time by up to 39%compared to online compression. A similar technology may be useful forsub-page-level deduplication, which advantageously would reduce theamount of data that needs to be transferred. Ethernet multicast may alsobe used to reduce the amount of data that needs to be transmitted.

Although GMGD is described in the context of its use within a singledatacenter for clarity. GMGD can also be used for migration of multipleVMs between multiple datacenters across a wide-area network (WAN). Thebasic operation of GMGD over a WAN remains the same.

Compared to existing approaches that use online hashing/compression,GMGD uses an offline duplicate tracking phase. This would in facteliminate the computational overhead of hash computation during themigration of multiple VMs over the WAN and improve the overallperformance applications that execute within the VMs.

Furthermore, as WAN link latencies reduce further, the cost ofperforming online hash computation (i.e. during migration) for largenumber of VMs would continue to increase. This would make GMGD moreattractive due to its use of offline duplicate tracking phase.

It is therefore an object to provide a system and method for gangmigration with global deduplication, comprising: providing a datacentercomprising a plurality of virtual machines in a cluster defined by a setof information residing in a first storage medium, the clustercommunicating through at least one data communication network;performing cluster-wide deduplication of the plurality of virtualmachines to identify redundant memory pages of the first storage mediumrepresenting the respective virtual machines that have correspondingcontent; initiating a simultaneous live migration of the plurality ofvirtual machines in the cluster, by communicating information sufficientto reconstitute the plurality of virtual machines in a cluster definedby the set of information residing in a second storage medium, throughthe at least one data communication network; based on the identificationof the redundant memory pages having corresponding content, selectivelycommunicating information representing the unique memory pages of thefirst storage medium through the at least one communication network tothe second storage medium, substantially without communicating all ofthe memory pages of the first storage medium; and subsequent tocommunication through the at least one communication network,duplicating the redundant memory pages of the first storage medium inthe second storage medium selectively dependent on the identifiedredundant memory pages, to reconstitute the plurality of virtualmachines in the second storage medium.

It is also an object to provide a system for gang migration with globaldeduplication, in a datacenter comprising a plurality of virtualmachines in a cluster defined by a set of information residing in afirst storage medium, the cluster communicating through at least onedata communication network, comprising: at least one processorconfigured to perform cluster-wide deduplication of the plurality ofvirtual machines to identify redundant memory pages of the first storagemedium representing the respective virtual machines that havecorresponding content; at least one communication link configured tocommunicate a simultaneous live migration of the plurality of virtualmachines in the cluster, by communicating information sufficient toreconstitute the plurality of virtual machines in a cluster defined bythe set of information residing in a second storage medium, through theat least one data communication network; the at least one processorbeing further configured, based on the identification of the redundantmemory pages having corresponding content, to selectively communicateinformation representing the unique memory pages of the first storagemedium through the at least one communication network to the secondstorage medium, substantially without communicating all of the memorypages of the first storage medium, and subsequently to communicatethrough the at least one communication network, duplicating theredundant memory pages of the first storage medium in the second storagemedium selectively dependent on the identified redundant memory pages,to reconstitute the plurality of virtual machines in the second storagemedium.

It is a still further object to provide a method for migration ofvirtual machines with global deduplication, comprising: providing aplurality of virtual machines at a local facility, defined by a set ofstored information comprising redundant portions, the network beinginterconnected with a wide area network; identifying at least a subsetof the redundant portions of the stored information; initiating asimultaneous live migration of the plurality of virtual machines bycommunicating through the wide area network to the remote location datasufficient to reconstitute the set of stored information comprising theidentification of the subset of the elements of the redundant portionsand the set of stored information less redundant ones of the subset ofthe redundant portions of the stored information; receiving at a remotelocation the data sufficient to reconstitute the set of storedinformation; duplicating the subset of the redundant portions of thestored information to reconstitute the set of stored informationdefining the plurality of virtual machines; and transferring an activestatus to the reconstituted plurality of virtual machines at the remotelocation

The identification of redundant portions or pages of memory isadvantageously performed using a hash table, which can be supplementedwith a dirty or delta table, such that the hash values need not all berecomputed in real time. A hash value of memory portion which remainsunchanged can be computed once, and so long as it remains unchanged, thehash value maintained. Hash values of pages or portions which changedynamically can be recomputed as necessary.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an illustration of GMGD;

FIG. 2 shows deduplication of identical pages during migration;

FIG. 3 shows the layout of the testbed used for evaluation;

FIG. 4 illustrates network traffic on core links when migrating idleVMs;

FIG. 5 illustrates network traffic on core links when migrating busyVMs;

FIG. 6 shows a downtime comparison;

FIG. 7 shows the total migration time with background traffic;

FIG. 8 shows background traffic performance with gang migration; and

FIG. 9 illustrates the proposed scatter-gather based live VM migration.

FIG. 10 shows a block diagram of a known computer network topology.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Architecture

The high-level architecture of GMGD is shown with respect to FIG. 1.

For simplicity of exposition, we first describe how GMGD operates whenVMs are live migrated from one rack of machines to another rack,followed by a description of its operation in the general case. For eachVM being migrated, the target physical machine is provided as an inputto GMGD. Target mapping of VMs could be provided by another VM placementalgorithm that maximizes some optimization criteria such as reducinginter-VM communication overhead Wang [26] or maximizing the memorysharing potential Wood [28]. GMGD does not address the VM placementproblem nor does it make any assumptions about the lack or presence ofinter-VM dependencies.

As shown in FIG. 1, a typical cluster consists of multiple racks ofphysical machines. Page P is identical among all four VMs at the sourcerack. VM1 and VM3 are being migrated to target rack 1. VM2 and VM4 arebeing migrated to target rack 2. One copy of P is sent to host 5 whichforwards P to host 6 in target rack 1. Another copy of P is sent to host8 which forwards P to host 9 in target rack 2. Thus identical pagesheaded for the same target rack are sent only once per target rack overthe core network, reducing network traffic overhead.

Machines within a rack are connected to a top-of-the-rack (TOR) switch.TOR switches are connected to one or more core switches usinghigh-bandwidth links (typically 10 Gbps or higher). GMGD does notpreclude the use of other layouts where the core network could becomeoverloaded. Migrating VMs from one rack to another increases the networktraffic overhead on the core links. To reduce this overhead, GMGDemploys a cluster-wide deduplication mechanism to identify and trackidentical pages across VMs running on different machines. As illustratedin FIG. 1, GMGD identifies the identical pages from VMs that are beingmigrated to the same target rack (or more generally, the same facility)and transfers only one copy of each identical page to the target rack.At the target rack, the first machine to receive the identical pagetransfers the page to other machines in the rack that also require thepage. This prevents duplicate transfers of an identical page over thecore network to the same target rack. GMGD can work with any live VMmigration technique, such as pre-copy [5] or post-copy [13]. In theprototype system described below, GMGD was implemented within thedefault pre-copy mechanism in QEMU/KVM. GMGD has two phases namelyduplicate tracking and live migration.

Physical machines in enterprise clusters often have multiple networkinterface cards (NICs) to increase the network bandwidth available toeach node. The availability of multiple NICs may be exploited to reducethe total migration time of live gang migration. The basic idea is thatmemory pages from each VM can be potentially be scattered duringmigration to multiple nodes at the target machine's rack. The scatteredpages could then be gathered by the target machine through paralleltransfers over multiple NICs. At the first look, this scatter-gatherapproach seems to introduce an additional hop in the page transferbetween the source and the target. However, when scatter-gatheroperation is combined with distributed deduplication across multipleVMs, the performance advantages of the approach becomes apparent. Inessence, pages with identical content on different VMs are scattered tothe same machine on the target rack. Only the first copy of theidentical page needs to be transferred, whereas subsequent pages arecommunicated via their unique identifiers (which includes VM's ID,target machine's ID, page offset and content hash).

A. Duplicate Tracking Phase

The Duplicate Tracking Phase is carried out during normal execution ofVMs at the source machines, before the migration begins. Its purpose isto identify all duplicate memory content (e.g., at the page-level)across all VMs residing on different machines. Content hashing is usedto detect identical pages. The pages having the same content yield thesame hash value. When the hashing is performed using a standard 160-bitSHA1. [12], the probability of collision is less than the probability ofa memory error, or an error in a TCP connection Chabaud [4]. Of course,different hashing or memory page identification technologies might beused. For example, in some environments, static content is mapped tomemory locations, in which case, the static content need only beidentified, such as with a content vector. In other cases, especiallywhere local processing capacity is available, a memory page whichdiffers by a small amount from a reference page may be coded by itsdifferences. Of course, other technologies which inferentially definethe content of the memory can be used.

In each machine, a per-node controller process coordinates the trackingof identical pages among all VMs in the machine. The per-node controllerinstructs a user-level QEMU/KVM process associated with each VM to scanthe VM's memory image, perform content based hashing and recordidentical pages. Since each VM is constantly executing, some of theidentical pages may be modified (dirtied) by the VM, either during thehashing, or after its completion. To identify these dirtied pages, thecontroller uses the dirty logging mode of QEMU/KVM. In this mode, all VMpages are marked as read-only in the shadow page table maintained by thehypervisor. The first write attempt to any read-only page results in atrap into the hypervisor which marks the faulted page as dirty in itsdirty bitmap and allows the write access to proceed. The QEMU/KVMprocess uses a hypercall to extract the dirty bitmap from KVM toidentify the modified pages.

The per-rack deduplication servers maintain a hash table, which ispopulated by carrying out a rack-wide content hashing of the 160-bithash values pre-computed by per-node controllers. Each hash is alsoassociated with a list of hosts in the rack containing the correspondingpages. Before migration, all deduplication servers exchange the hashvalues and host list with other deduplication servers.

In some cases, data in memory is maintained even though the datastructures corresponding to those memory pages are no longer in use. Inorder to avoid need for migration of such data, a table may bemaintained of “in use” or “available” memory pages, and the migrationlimited to the live data structures or program code. In many cases,operating system resources already maintain such table(s), and thereforethese need not be independently created or maintained.

B. Migration Phase

In this phase, all VMs are migrated in parallel to their destinationmachines. The pre-computed hashing information is used to perform thededuplication of the transferred pages at both the host and the racklevels. QEMU/KVM queries the deduplication server for its rack toacquire the status of each page. If the page has not been transferredalready by another VM, then its status is changed to sent and it istransferred to the target QEMU/KVM. For subsequent instances of the samepage from any other VM migrating to the same rack, QEMU/KVM transfersthe page identifier. Deduplication servers also periodically exchangethe information about the pages marked as sent, which allows the VMs inone rack to avoid retransmission of the pages that are already sent bythe VMs from another rack.

C. Target-Side VM Deduplication

The racks used as targets for VM migration are often not empty. They mayhost VMs containing pages that are identical to the ones being migratedinto the rack. Instead of transferring such pages from the source racksvia core links, they are forwarded within the target rack from the hostsrunning the VMs to the hosts receiving the migrating VMs. Thededuplication server at the target rack monitors the pages within hostedVMs and synchronizes this information with other deduplication servers.Per-node controllers perform this forwarding of identical pages amonghosts in the target rack.

D. Scatter-Gather VM Deduplication

FIG. 9 shows the potential architecture of the system for a two-rackscenario, for simplicity of exposition. The system is easily generalizedfor a larger multi-rack scenario. One or more VMs are migrated fromsource machine(s) in one rack to target machine(s) in another rack.

Machines in each rack together export a virtual memory device, which isessentially a logical device that aggregates the free memory spaceavailable on each machine in the rack. Within each node, the virtualmemory device is exported via the block device interface, which isnormally used to perform I/O operations. Such a virtualized memorydevice can be created using the MemX system Deshpande [31], Hines [32]Hines [33]. See, osnet.cs.binghamton.edu/projects/memx.html, expresslyincorporated herein by reference.

At the source node, memory pages of the VMs being migrated are writtento the virtual memory device, which transparently scatters the pagesover the network to machines in Rack 2 and keeps track of their locationusing a distributed hash table. These pages are also deduplicatedagainst identical pages belonging to other VMs. The target node thenreads the pages from the virtual memory device, which transparentlygathers pages from other nodes on Rack 2.

Note that the scatter-gather approach can be used with both pre-copy andpost-copy migration mechanisms. With pre-copy, the scatter and gatherphases overlap with the iterative copy phase, enabling the latter tocomplete quickly, so that the source can initiate downtime earlier thanit would have through traditional pre-copy. With traditional pre-copy,the source node may take a long time to initiate downtime depending uponwhether the workload is read-intensive or write-intensive. Withpost-copy, the scatter operation allows active-push phase to quicklyeliminate residual state from the source node, and the gather phasequickly transfers the memory content to the intended target host.

The scatter and gather operation can use multiple NICs at the source andtarget machines to perform parallel transfer of memory pages. Inaddition, with the availability of multi-core machines, multipleparallel threads at each node can carry out parallel reception andprocessing of the VM's memory pages. These two factors, combined withcluster-wide deduplication, will enable significant speedups insimultaneous migration of multiple VMs in enterprise settings.

EXAMPLE

A prototype of GMGD was implemented in the QEMU/KVM virtualizationenvironment. The implementation is completely transparent to the usersof the VMs. With QEMU/KVM, each VM is spawned as a process on a hostmachine. A part of the virtual address space of the QEMU/KVM process isexported to the VM as its physical memory.

A. Per-Node Controllers

Per-node controllers are responsible for managing the deduplication ofoutgoing and incoming VMs. The controller component managing theoutgoing VMs is called the source side and component managing theincoming VMs is called the target side. The controller sets up a sharedmemory region that is accessible only by other QEMU/KVM processes. Theshared memory contains a hash table which is used for tracking identicalpages. Note that the shared memory poses no security vulnerabilitiesbecause it is outside the physical memory region of the VM in theQEMU/KVM process' address space and is not accessible by the VM itself.

The source side of the per-node controller coordinates the localdeduplication of memory among co-located VMs. Each QEMU/KVM processscans its VM's memory and calculates a 160-bit SHA1 hash for each page.These hash values are stored in the hash table, where they are comparedagainst each other. A match of two hash values indicates the existenceof two identical pages. Scanning is performed by a low priority threadto minimize interference with the VMs' execution. It is noted that thehash table may be used for other purposes, and therefore can be a sharedresource with other facilities.

The target side of the per-node controller receives incoming identicalpages from other controllers in the rack. It also forwards the identicalpages received on behalf of other machines in the rack to theirrespective controllers. Upon reception of an identical page, thecontroller copies the page into the shared memory region, so that itbecomes available to incoming VMs.

B. Deduplication Server

Deduplication servers are to per-node controllers what per-nodecontrollers are to VMs. Each rack contains a deduplication server thattracks the status of identical pages among VMs that are migrating to thesame target rack and the VMs already at the target rack. Deduplicationservers maintain a content hash table to store this information. Uponreception of a 160-bit hash value from the controllers, the last 32-bitsof the 160-bit hash are used to find a bucket in the hash table. In thebucket, the 160-bit hash entry is compared against the other entriespresent. If no matching entry is found, a new entry is created.

Each deduplication server can currently process up to 200,000 queriesper second over a 1 Gbps link. This rate can potentially handlesimultaneous VM migrations from up to 180 physical hosts. For context,common 19-inch racks can hold 44 servers of 1U (1 rack unit) height[24]. A certain level of scalability is built into the deduplicationserver by using multiple threads for query processing, fine-grainedreader/writer locks, and batching queries from VMs to reduce thefrequency of communication with the deduplication server.

Finally, the deduplication server does not need to be a separate serverper rack. It can potentially run as a background process within one ofthe machines in the rack that also runs VMs provided that a few spareCPU cores are available for processing during migration.

C. Operations at the Source Machine

Upon initiating simultaneous migration of VMs, the controllers instructindividual QEMU/KVM processes to begin the migration. From this pointonward, the QEMU/KVM processes communicate directly with thededuplication servers, without any involvement from the controllers.After commencing the migration, each QEMU/KVM process startstransmitting every page of its respective VM. For each page it checks inthe local hash table whether the page has already been transferred. Eachmigration process periodically queries its deduplication server for thestatus of next few pages it is about to transfer. The responses from thededuplication server are stored into the hash table, in order to beaccessible to the other co-located VMs. If the QEMU/KVM processdiscovers that a page has not been transferred, then it transmits theentire page to its peer QEMU/KVM process at the target machine alongwith its unique identifier. QEMU/KVM at the source also retrieves fromthe deduplication server a list of other machines in the target rackthat need an identical page. This list is also sent to the targetmachine's controller, which then retrieves the page and sends it to themachines in the list. Upon transfer the page is marked as sent in thesource controller's hash table. The QEMU/KVM process periodicallyupdates its deduplication server with the status of the sent pages. Thededuplication server also periodically updates other deduplicationservers with a list of identical pages marked as sent. Dirty pages andunique pages that have no match with other VMs are transferred in theirentirety to the destination.

FIG. 2 shows the message exchange sequence between the deduplicationservers and QEMU/KVM processes for an inter-host deduplication of pageP.

D. Operations at the Target Machine

On the target machine each QEMU/KVM process allocates a memory regionfor its respective VM where incoming pages are copied. Upon reception ofan identical page, the target QEMU/KVM process copies it into the VM'smemory and inserts it into the target hash table according to itsidentifier. If only an identifier is received, a page corresponding tothe identifier is retrieved from the target hash table, and copied intothe VM's memory. Unique and dirty pages are directly copied into theVMs' memory space.

E. Remote Pages

Remote pages are deduplicated pages that were transferred by hosts otherthan the source host. Identifiers of such pages are accompanied by aremote flag. Such pages become available to the waiting hosts in thetarget rack only after the carrying host forwards them. Therefore,instead of searching for such remote pages in the target hash tableimmediately upon reception of an identifier, the identifier and theaddress of the page are inserted into a per-host waiting list. Aper-QEMU/KVM process thread, called a remote thread, periodicallytraverses the list, and checks for each entry if the page correspondingto the identifier has been added into the target shared memory. Thereceived pages are copied into the memory of the respective VMs afterremoving the entry from the list. Upon reception of a more recentdirtied copy of the page whose entry happens to be on the waiting list,the corresponding entry is removed from the list to prevent the threadfrom over-writing the page with its stale copy. The identical pagesalready present at the target rack before the migration are also treatedas the remote pages. The per-node controllers in the target rack forwardsuch pages to the listed target hosts. This avoids their transmissionover the core network links from the source racks. However, pagesdirtied by VMs running in the target rack are not forwarded to otherhosts and they are requested by the corresponding hosts from theirrespective source hosts.

F. Downtime Synchronization

Initiating a VM's downtime before completing target-to-target transferscan increase its downtime duration. However, in the default QEMU/KVMmigration technique, downtime is started at the source's discretion andthe decision is made solely on the basis of the number of pagesremaining to be transferred and the perceived link bandwidth at thesource. Therefore to avoid the overlap between the downtime andtarget-to-target transfers, a synchronization mechanism is implementedbetween the source and the target QEMU/KVM processes. The sourceQEMU/KVM process is prevented from starting the VM downtime and keep itin the live pre-copy iteration mode until all of its pages have beenretrieved at the target and copied into memory. Once all remote pagesare in place, the source is instructed by the target to initiate thedowntime. This allows VMs to minimize their downtime, as only theremaining dirty pages at the source are transferred during the downtime.

G. Desynchronizing Page Transfers

An optimization was implemented to improve the efficiency ofdeduplication. There is a small time lag between the transfer of anidentical page by a VM and the status of the page being reflected at thededuplication server. This lag can result in duplicate transfer of someidentical pages if two largely identical VMs start migration at the sametime and transfer their respective memory pages in the same order ofpage offsets. To reduce such duplicate transfers, each VM transferspages in different order depending upon their assigned VM number, so asto break any synchronization with other VMs. This reduces the likelihoodthat identical pages from different VMs may be transferred around thesame time.

Evaluation

GMGD was evaluated in a 30-node cluster testbed having high-bandwidthlow-latency Gigabit Ethernet. Each physical host has two Quad core 2 GHzCPUs, 16 GB of memory, and 1 Gbps network card. FIG. 3 shows the layoutof the cluster testbed consisting of three racks, each connected to adifferent top of rack (TOR) Ethernet switch. The TOR switches areconnected to each other by a 10 GigE optical link, which acts as thecore link. Although we had only the 30-node three-rack topologyavailable for evaluation, GMGD can be used on larger topologies. Livemigration of all VMs is initiated simultaneously and memory pages fromthe source hosts traverse the 10 GigE optical link between the switchesto reach the target hosts. For most of the experiments, each machinehosts four VMs and each VM has 2 virtual CPUs (VCPUs) and 1 GB memory.We compare GMGD against the following VM migration techniques.

(1) Online Compression (OC):

This is the default VM migration technique used by QEMU/KVM. Beforetransmission, it compresses pages that are filled with uniform content(primarily pages filled with zeros) by representing the entire page withjust one byte. At the target, such pages are reconstructed by filling anentire page with the same byte. Other pages are transmitted in theirentirety to the destination.

(2) Gang Migration with Local Deduplication (GMLD):

This technique uses content hashing to deduplicate the pages across VMsco-located on the same host Deshpande [8]. Only one copy of identicalpages is transferred from the source host.

In initial implementations of GMGD prototype, the use of online hashingwas considered, in which hash computation and deduplication areperformed during migration (as opposed to before migration). Hashcomputation is a CPU-intensive operation. In the evaluations, it wasfound that the online hashing variant performed very poorly, in terms oftotal migration time, on high-bandwidth low-delay Gigabit Ethernet. Forexample, online hashing takes 7.3 seconds to migrate a 1 GB VM and 18.9seconds to migrate a 4 GB VM, whereas offline hashing takes only 3.5seconds and 4.5 seconds respectively. CPU-heavy online hash computationbecame a serious performance bottleneck and, in fact, yielded worsetotal migration times than even the simple OC technique described above.Given that the total migration time of online hashing variant isconsiderably worse than offline hashing, but the savings in networktraffic are just comparable, the results for online hashing are omittedin the reports of experiments below.

A. Network Load Reduction

1) Idle VMs: Here an equal number of VMs are migrated from each of thetwo source racks, i.e., for 12×4 configuration, 4 VMs are migrated fromeach of the 6 hosts on each source rack. FIG. 4 shows the amount of datatransferred over the core links for the three VM migration schemes withan increasing number of hosts, each host running four 1 GB idle VMs.Since OC only optimizes the transfer of uniform pages, a set that mainlyconsists of zero pages it transfers the highest amount of data. GMLDalso deduplicates zero pages in addition to the identical pages acrossthe co-located VMs. As a result, it sends less data than OC. GMGDtransfers the lowest amounts of data. For 12 hosts, GMGD shows more than51%, and 19% decrease in the data transferred through the core linksover OC and GMLD respectively.)

2) Busy VMs: To evaluate the effect of busy VMs on the amount of datatransferred during their migration, Dbench [6], a filesystem benchmark,was run inside VMs. Dbench performs file I/O on a network attachedstorage. It provides an adversarial workload for GMGD because it usesthe network interface for communication and DRAM as a buffer. Dbench wasmodified to write random data, hence its memory footprint consisted ofunique pages that cannot be deduplicated. Also the execution of Dbenchwas initiated after the deduplication phase of GMGD to ensure that thememory consumed by Dbench was not deduplicated. The VMs are migratedwhile execution of Dbench is in progress. FIG. 5 shows that GMGD yieldsa 48% reduction in the amount of data transferred over OC and 18%reduction over GMLD.

B. Total Migration Time

1) Idle VMs: To measure the total migration time of different migrationtechniques, the end-to-end (E2E) total migration time is measured, i.e.the time taken from the start of the migration of the first VM to theend of the migration of the last VM. Cluster administrators areconcerned with E2E total migration time of groups of VMs since itmeasures the time for which the migration traffic occupies the corelinks. The idle VM section of Table I shows the total migration time foreach migration technique with an increasing number of hosts containingidle VMs. Note that even with the maximum number of hosts (i.e. 12 with6 from each source rack), the core optical link remains unsaturated.Therefore, for each migration technique nearly constant total migrationtime is observed, irrespective of the number of hosts. Further, amongall three techniques, OC has highest total migration time for any numberof hosts, which is proportional to the amount of data it transfers.GMGD's total migration time is slightly higher than that of GMLD,approximately 4% higher for 12 hosts.

The difference between the total migration time of GMGD and GMLD can beattributed to the overhead associated with GMGD for performingdeduplication across the hosts. While the migration is in progress, itqueries with the deduplication server to read, or update the status ofdeduplicated pages. Such requests need to be sent frequently to performeffective deduplication.

2) Busy VMs: Table I shows that Dbench equally increases the totalmigration time of all the VM migration techniques as compared to theirtotal migration time with idle VMs. However, a slight reduction in thetotal migration time is observed with an increasing number of hosts.With a lower number of hosts (and therefore a lower number of VMs), theincoming 1 Gbps Ethernet link to the network attached storage servermight remain unsaturated, and therefore each Dbench instance can performI/O at a faster rate compared to a scenario with more VMs, where the VMsmust contend for the available bandwidth. The faster I/O rate results inhigher page dirtying rate, resulting in more data being transferredduring VMs' migration.

C. Downtime

FIG. 6 shows that increasing the number of hosts does not have asignificant impact on the downtimes for all three schemes. This isbecause each VM's downtime is initiated independently of other VMs.However, the downtime for OC is slightly higher, in the range of 250 msto 280 ms.

D. Background Traffic

With the three-rack testbed used in the above experiments, the corelinks remain uncongested due to limited number of hosts in each sourcerack. To evaluate the effect of congestion at core links, for theremaining experiments a 2-rack topology was used, consisting o onesource rack and one target rack, each containing 10 hosts. With thislayout, migration of VMs from 10 source hosts is able to saturate thecore link between the TOR switches.

The effect of background network traffic on different migrationtechniques was investigated. Conversely, the effect of differentmigration techniques on other network bound applications in the clusterwas compared. For this experiment, the 10 GigE core link between theswitches was saturated with VM migration traffic and background networktraffic. 7 Gbps of background dNetperf[2] UDP traffic was transmittedfrom the source rack to the target rack such that it competes with theVM migration traffic on the core link.

FIG. 7 shows the comparison of total migration time with UDP backgroundtraffic for the aforementioned setup. With an increasing number of VMsand hosts, the network contention and packet loss on the 10 GigE corelink also increases. A larger total migration time for all threetechniques was observed as compared to the corresponding idle VMmigration times listed in Table I. However, observe that GMGD has lowertotal migration time than both OC and GMLD, in contrast to Table I whereGMGD had higher TMT compared to GMLD. This is because, in response topacket loss at the core link, all VM migration sessions (which are TCPflows) backoff. However, the backoff is proportional to the amount ofdata transmitted by each VM migration technique. Since GMGD transfersless data, it suffers less from TCP backoff due to network congestionand completes the migration faster. FIG. 8 shows the converse effect,namely, the impact of VM migration on the performance of Netperf. Withan increasing number of migrating VMs, Netperf UDP packet lossesincrease due to network contention. For 10 hosts, GMGD receives 13% morepackets than OC and 5.7% more UDP packets than GMLD.

E. Application Degradation

Table II compares the degradation of applications running inside the VMsduring migration using 10×4 configuration.

NFS I/O Benchmark: VMs images are often stored on a network attachedstorage, which can be located outside the rack hosting the VMs. Any I/Ooperations from VMs traverse one or more switches before reaching thestorage server. Here the impact of migration on the performance of I/Ooperations from VMs in the above scenario is evaluated. Two NFS serversare hosted on two machines located outside the source rack, and eachconnected to the switch with 1 Gbps Ethernet link. Each VM mounts apartition from one of the NFS servers, and runs a 75 MB sequential filewrite benchmark. The migration of VMs is carried out while the benchmarkis in progress, and the effect of migration on the performance of thebenchmark is observed. Since, at the source network interface, the NFStraffic interferes with the migration traffic, the benchmark showsdegradation proportional to the amount of data the migration techniquetransfers. Table II shows the NFS write bandwidth per VM. GMGD yieldsthe smallest reduction in observed bandwidth among the three.

TCP RR: Netperf TCP RR VM workload was used to analyze the effect of VMmigration on the inter-VM communication. TCP RR is a synchronous TCPrequest-response test. 20 VMs from 5 hosts are used as senders, and 20VMs from the other 5 hosts as receivers. The VMs are migrated while thetest is in progress and measure the performance of TCP RR. Figures inTable II show the average transaction rate per sender VM. Due to thelower amount of data transferred through the source NICs, GMGD keeps theNICs available for the inter-VM TCP RR traffic. Consequently, it leastaffects the performance of TCP RR and gives the highest number oftransactions per second among the three.

Sum of Subsets: is a CPU-intensive workload that, given a set ofintegers and an integer k, finds a non-empty subset that sum to k. Thisprogram is run in the VMs during their migration to measure the averageper-VM completion time of the program. Although GMGD again shows theleast adverse impact on the completion time, the difference isinsignificant due to the CPU-intensive nature of the workload.

Performance Overheads

Duplicate Tracking: Low priority threads perform hash computation anddirty-page logging in the background. With 4 VMs and 8 cores permachine, a CPU-intensive workload (sum of subsets) experienced an 0.34%overhead and a write-intensive workload (random writes to memory)experienced a 1.99% overhead. With 8 VMs per machine, the overheads were5.85% and 3.93% respectively, primarily due to CPU contention.

Worst-case workload: GMGD does not introduce any additional overheads,compared against OC and GMLD, when running worst-case workloads. VMs runa write-intensive workload that reduces the likelihood of deduplicationby modifying 1.7 times as much data as the size of each VM. All thethree techniques show no discernible performance difference in terms oftotal migration time, data transferred, and application degradation.

Space overhead: In the worst case, when all pages are unique, the spaceoverhead for storing the deduplication data structures in each host is4.3% of the total memory of all VMs.

Hardware Overview

FIG. 8 (see U.S. Pat. No. 7,702,660, issued to Chan, expresslyincorporated herein by reference), shows a block diagram thatillustrates a computer system 400. Computer system 400 includes a bus402 or other communication mechanism for communicating information, anda processor 404 coupled with bus 402 for processing information. Theprocessor may be a multicore processor, and the computer system may beduplicated as a cluster of processors or computing systems. Computersystem 400 also includes a main memory 406, such as a random accessmemory (RAM) or other dynamic storage device, coupled to bus 402 forstoring information and instructions to be executed by processor 404.Main memory 406 also may be used for storing temporary variables orother intermediate information during execution of instructions to beexecuted by processor 404. Computer system 400 further includes a readonly memory (ROM) 408 or other static storage device coupled to bus 402for storing static information and instructions for processor 404. Astorage device 410, such as a magnetic disk or optical disk, is providedand coupled to bus 402 for storing information and instructions. Thecomputer system 400 may host a plurality of virtual machines (VMs),which each act as a complete and self-contained computing environmentfor the software and user interaction, while sharing physical resources.

Computer system 400 may be coupled via bus 402 to a display 412, such asa liquid crystal display monitor, for displaying information to acomputer user. An input device 414, including alphanumeric and otherkeys, is coupled to bus 402 for communicating information and commandselections to processor 404. Another type of user input device is cursorcontrol 416, such as a mouse, a trackball, or cursor direction keys forcommunicating direction information and command selections to processor404 and for controlling cursor movement on display 412. This inputdevice typically has two degrees of freedom in two axes, a first axis(e.g., x) and a second axis (e.g., y), that allows the device to specifypositions in a plane. In a server environment, typically the userinterface for an administrator is provided remotely through a virtualterminal technology, though the information from the physicalcommunications ports can also be communicated remotely.

The techniques described herein may be implemented through the use ofcomputer system 400, which will be replicated for the source anddestination cluster, and each computer system 400 will generally have aplurality of server “blades”. According to one embodiment of theinvention, those techniques are performed by computer system 400 inresponse to processor 404 executing one or more sequences of one or moreinstructions contained in main memory 406. Such instructions may be readinto main memory 406 from another machine-readable medium, such asstorage device 410. Execution of the sequences of instructions containedin main memory 406 causes processor 404 to perform the process stepsdescribed herein. In alternative embodiments, hard-wired circuitry maybe used in place of or in combination with software instructions toimplement the invention. Thus, embodiments of the invention are notlimited to any specific combination of hardware circuitry and software.

The term “machine-readable medium” as used herein refers to any mediumthat participates in providing data that causes a machine to operationin a specific fashion, and is tangible and non-transitory. In anembodiment implemented using computer system 400, variousmachine-readable media are involved, for example, in providinginstructions to processor 404 for execution. Such a medium may take manyforms, including but not limited to, non-volatile media, and volatilemedia, which may be local or communicate through a transmission media ornetwork system. Non-volatile media includes, for example, optical ormagnetic disks, such as storage device 410. Volatile media includesdynamic memory, such as main memory 406. Transmission media includescoaxial cables, copper wire and fiber optics, including the wires thatcomprise bus 402. Transmission media can also take the form of acousticor light waves, such as those generated during radio-wave and infra-reddata communications.

Common forms of machine-readable media include, for example, a hard diskor any other magnetic medium, a DVD or any other optical medium, a RAM,a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, orany other medium from which a computer can read.

Various forms of machine-readable media may be involved in carrying oneor more sequences of one or more instructions to processor 404 forexecution. For example, the instructions may initially be carried on amagnetic disk or solid state storage media of a remote computer. Theremote computer can load the instructions into its dynamic memory. Bus402 carries the data to main memory 406, from which processor 404retrieves and executes the instructions. The instructions received bymain memory 406 may optionally be stored on storage device 410.

Computer system 400 also includes a communication interface 418 coupledto bus 402. Communication interface 418 provides a two-way datacommunication coupling to a network link 420 that is connected to alocal network 422. For example, communication interface 418 may be a 10Gigabit Ethernet port to provide a data communication connection toswitch or router. The Ethernet packets, which maybe jumbo packets (e.g.,8 k) can be routed locally within a data center using TCP/IP or in somecases UDP or other protocols, or externally from a data center typicallyusing TCP/IP protocols. Wireless links may also be implemented. In anysuch implementation, communication interface 418 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 420 typically provides data communication through one ormore networks to other data devices. For example, network link 420 mayprovide a connection through local network 422 to a host computer 424 orto data equipment operated by an Internet Service Provider (ISP) 426 orto an Internet 428 backbone communication link. In the case where an ISP426 is present, the ISP 426 in turn provides data communication servicesthrough the world wide packet data communication network now commonlyreferred to as the Internet 428. Local network 422 and Internet 428 bothuse electrical, electromagnetic or optical signals that carry digitaldata streams. The signals through the various networks and the signalson network link 420 and through communication interface 418, which carrythe digital data to and from computer system 400, are exemplary forms ofcarrier waves transporting the information.

Computer system 400 can send messages and receive data, includingprogram code, through the network(s), network link 420 and communicationinterface 418. In the Internet example, a server 430 might transmit arequested code for an application program through Internet 428, ISP 426,local network 422 and communication interface 418.

The information received is stored in a buffer memory and may becommunicated to the processor 404 as it is received, and/or stored instorage device 410, or other non-volatile storage.

U.S. 2012/0173732, expressly incorporated herein by reference, disclosesvarious embodiments of computer systems, the elements of which may becombined or subcombined according to the various permutations.

It is understood that this broad invention is not limited to theembodiments discussed herein, but rather is composed of the variouscombinations, subcombinations and permutations thereof of the elementsdisclosed herein, including aspects disclosed within the incorporatedreferences. The invention is limited only by the following claims

REFERENCES

Each of the following references is each expressly incorporated hereinby reference in its entirety.

[1] A. Arcangeli, I. Eidus, and C. Wright. Increasing memory density byusing ksm. In Proc. of Linux Symposium, July 2009.

[2] Netperf: Network Performance Benchmark. www.netperf.org/netperf.

[3] Edouard Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Runningcommodity operating systems on scalable multiprocessors. In ACMTransactions on Computer Systems, October 1997.

[4] F. Chabaud and A. Joux. Differential collisions in sha-0. In Proc.of Annual International Cryptology Conference, August 1998.

[5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I.Pratt, and A. Warfield. Live migration of virtual machines. In Proc. ofNetwork System Design and Implementation, May 2005.

[6] Dbench. samba.org/ftp/tridge/dbench.

[7] U. Deshpande, U. Kulkarni, and K. Gopalan. Inter-rack live migrationof multiple virtual machines. In Proc. of Workshop on VirtualizationTechnologies in Distributed Computing (to appear), June 2012.

[8] U. Deshpande, X. Wang, and K. Gopalan. Live gang migration ofvirtual machines. In High Performance Distributed Computing, June 2010.

[9] 10-Gigabit Ethernet. en.wikipedia.org/wiki/10 gigabit ethernet.

[10] Gigabit Ethernet. en.wikipedia.org/wiki/gigabit ethernet.

[11] D. Gupta, S. Lee, M. Vrable, S. Savage, A. C Snoeren, U. Varghese,

G. M Voelker, and A. Vahdat. Difference engine: Harnessing memoryredundancy in virtual machines. In Proc. of Operating Systems Design andImplementation, December 2010.

[12] OpenSSL SHA1 hash.www.openssl.org/docs/crypto/sha.html.

[13] M. Hines, U. Deshpande, and K. Gopalan. Post-copy live migration ofvirtual machines. Operating Syst. Review, 43(3):14-26, July 2009.

[14] W. Huang, Q. Gao, J. Liu, and D. K. Panda. High performance virtualmachine migration with RDMA over modern interconnects. In Proc. of IEEEinternational Conference on Cluster Computing, 2007.

[15] Infiniband. en.wikipedia.org/wiki/infiniband.

[16] H. Jin, L. Deng, S. Wu, X. Shi, and X. Pan. Live virtual machinemigration with adaptive, memory compression. In Proc. of ClusterComputing and Workshops, August 2009.

[17] Samer Al Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and MateiRipeanu. Vmflock: Virtual machine co-migration for the cloud. In Proc.of High Performance Distributed Computing, June 2011.

[18] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. Kvm: Thelinux virtual machine monitor. In Proc. of Linux Symposium, June 2007.

[19] G. Milos, D. G. Murray, S. Hand, and M. A. Fetterman. Satori:Enlightened page sharing. In USENIX Annual Technical Conference, 2009.

[20] M. Nelson, B. H Um, and G. Hutchins. Fast transparent migration forvirtual machines. In USENIX Annual Technical Conference, April 2005.

[21] A. Nocentino and P. M. Ruth. Toward dependency-aware live virtualmachine migration. In Proc. of Virtualization Technologies inDistributed Computing, June 2009.

[22] P. Riteau, C. Morin, and T. Priol. Shrinker: Improving livemigration of virtual clusters over wans with distributed datadeduplication and content-based addressing. In Proc. of EUROPAR.September 2011.

[23] C. P Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S Lam, and

M. Rosenblum. Optimizing the migration of virtual computers. In Proc. ofOperating Systems Design and implementation, December 2002.

[24] Rack Unit. en.wikipedia.org/wiki/rack unit.

[25] C. A. Waldspurger. Memory resource management in VMware ESX server.In Operating Systems Design and Implementation, December 2002.

[26] J. Wang, K. L Wright, and K. Gopalan. XenLoop: a transparent highperformance inter-VM network loopback. In Proc. of High performancedistributed computing, June 2008.

[27] T. Wood, K. K. Ramakrishnan, P. Shenoy, and J. Van Der Merwe.Cloudnet: dynamic pooling of cloud resources by live wan migration ofvirtual machines. In Virtual Execution Environments, March 2011.

[28] T. Wood, G. Tarasuk-Levin, P. Shenoy, P. Desnoyers, E. Cecchet, andM. D. Corner. Memory buddies: exploiting page sharing for smartcolocation in virtualized data centers. In Proc. of Virtual ExecutionEnvironments, March 2009.

[29] L. Xia and P. A. Dinda. A case for tracking and exploitinginternode and intra-node memory content sharing in virtualizedlarge-scale parallel systems. In Proceedings of the 6th internationalworkshop on Virtualization Technologies in Distributed Computing Date,pages 11-18. ACM, 2012.

[30] X. Zhang, Z. Huo, J. Ma, and D. Meng. Exploiting data deduplicationto accelerate live virtual machine migration. In Proc. of InternationalConference on Cluster Computing, September 2010.

[31] Umesh Deshpande, Beilan Wang, Shafee Hague, Michael Hines, andKartik Gopalan, MemX: Virtualization of Cluster-wide Memory, In Proc. of39th International Conference on Parallel Processing (ICPP), San Diego,Calif., USA, September 2010.

[32] Michael Hines and Kartik Gopalan, MemX: Supporting Large MemoryWorkloads in Xen Virtual Machines, In Proc. of the InternationalWorkshop on Virtualization Technology in Distributed Computing (VTDC),Reno, Nev., November 2007.

[33] Michael Hines, Jian Wang, Kartik Gopalan, Distributed Anemone:Transparent Low-Latency Access to Remote Memory in Commodity Clusters,In Proc. of the International Conference on High Performance Computing(HiPC), December 2006.

What is claimed is:
 1. A memory management system, comprising: a firstplurality of virtual machines in a first cluster, having a firstcontroller, defined by a first set of information residing in a firststorage medium; a first hash table of the memory pages of the firststorage medium generated by the first controller; a communication portconfigured to communicate through at least one data communicationnetwork with a second cluster, having a second controller, defined by asecond set of information residing in a second storage medium, andhaving a second hash table of the memory pages of the second storagemedium generated by the second controller; and at least one automatedprocessor, configured to: identify first redundant memory pages of thefirst storage medium representing the respective virtual machines of thefirst cluster that have identical memory page content based on at leastthe first hash table; track memory pages in the first storage mediumthat have changed content with respect to the first hash table; andcontrol a periodic exchange of the first and second hash tables betweenthe first cluster and the second cluster through the at least onecommunication network.
 2. The memory management system according toclaim 1, wherein the at least one automated processor is furtherconfigured to deduplicate the first storage medium by eliminating thefirst redundant memory pages that have not been tracked to have changedcontent with respect to the first hash table.
 3. The memory managementsystem according to claim 1, wherein the at least one automatedprocessor is further configured to migrate a state of the firstplurality of virtual machines in a first cluster to a second pluralityof virtual machines in the second cluster, by communicating only uniquememory pages of the first storage medium based on at least theidentified first redundant memory pages and the tracked memory pages. 4.The memory management system according to claim 1, wherein the at leastone automated processor is further configured to control a simultaneouslive migration of the first plurality of virtual machines, bycommunicating information to reconstitute the first plurality of virtualmachines as a second plurality of virtual machines in the secondcluster.
 5. The memory management system according to claim 1, furthercomprising a hypervisor implemented by the at least one automatedprocessor to track the memory pages in the first storage medium thathave changed content with respect to the first hash table, thehypervisor being configured to: control a respective virtual machine;mark memory pages of a respective virtual machine as write-only afterhashing; maintain a shadow page table; trap a first write attempt to arespective memory page; update the shadow page table after the firstwrite attempt to reflect the change, and permit the write attempt toproceed.
 6. The memory management system according to claim 1, furthercomprising a third hash table received through the communication portfrom a third cluster, wherein the at least one processor is furtherconfigured to communicate at least one memory page through thecommunication port selectively dependent on at least a content of thethird hash table.
 7. The memory management system according to claim 1,wherein the first controller maintains a page status for each memorypage, and updates a page status for a respective memory page when thememory page is sent from the first cluster through the communicationport.
 8. The memory management system according to claim 1, wherein thefirst controller resides within a respective virtual machine on thefirst cluster.
 9. The memory management system according to claim 8,wherein the first controller comprises has a memory space in a sharedmemory region which is inaccessible to the first plurality of virtualmachines of the first cluster, and the first controller has access tothe first set of information residing in a first storage medium.
 10. Amethod for managing memory, comprising: providing a first plurality ofvirtual machines in a first cluster, having a first controller, definedby a first set of information residing in a first storage medium;generating a first hash table of the memory pages of the first storagemedium by the first controller; communicating through at least one datacommunication network with a second cluster, having a second controller,defined by a second set of information residing in a second storagemedium, and having a second hash table of the memory pages of the secondstorage medium generated by the second controller; identifying firstredundant memory pages of the first storage medium representing therespective virtual machines of the first cluster that have identicalmemory page content based on at least the first hash table; trackingmemory pages in the first storage medium that have changed content withrespect to the first hash table; and controlling a periodic exchange ofthe first and second hash tables between the first cluster and thesecond cluster through the at least one communication network.
 11. Themethod according to claim 10, further comprising deduplicating the firststorage medium by eliminating the first redundant memory pages that havenot been tracked to have changed content with respect to the first hashtable.
 12. The method according to claim 10, further comprisingmigrating a state of the first plurality of virtual machines in a firstcluster to a second plurality of virtual machines in the second cluster,by communicating only unique memory pages of the first storage mediumbased on at least the identified first redundant memory pages and thetracked memory pages.
 13. The method according to claim 10, furthercomprising controlling a simultaneous live migration of the firstplurality of virtual machines, by communicating information toreconstitute the first plurality of virtual machines as a secondplurality of virtual machines in the second cluster.
 14. The methodaccording to claim 10, further comprising providing a hypervisor fortracking the memory pages in the first storage medium that have changedcontent with respect to the first hash table, said tracking comprising:controlling a respective virtual machine; marking memory pages of arespective virtual machine as write-only after hashing; maintaining ashadow page table; trapping a first write attempt to a respective memorypage; updating the shadow page table after the first write attempt toreflect the change, and permitting the write attempt to proceed.
 15. Themethod according to claim 10, further comprising providing a third hashtable received through the communication network from a third cluster,further comprising communicating at least one memory page through thecommunication network selectively dependent on at least a content of thethird hash table.
 16. The method according to claim 10, furthercomprising maintaining a page status for each memory page by the firstcontroller, and updating a page status for a respective memory page whenthe memory page is sent from the first cluster through the communicationnetwork.
 17. The method according to claim 10, further comprisingimplementing the first controller within a respective virtual machine onthe first cluster.
 18. The method according to claim 17, wherein thefirst controller comprises has a memory space in a shared memory regionwhich is inaccessible to the first plurality of virtual machines of thefirst cluster, further comprising accessing, by the first controller,the first set of information residing in a first storage medium.
 19. Amethod for memory management, comprising: providing a plurality ofvirtual machines having an active status for processing workload,defined by a set of stored information comprising redundant portions;generating a hash table comprising respective hashes of respective pagesof the plurality of virtual machines; identifying at least a subset ofthe redundant portions of the stored information based on at least thehash table; maintaining a list of respective memory pages that havechanged content after the respective hash is generated; and periodicallytransmitting the generated hash table, and periodically receiving aremote hash table through a communication network.
 20. The methodaccording to claim 19, further comprising reconstituting the pluralityof virtual machines by: initiating a simultaneous migration of theplurality of virtual machines by communicating information forreconstituting the plurality of virtual machines at the remote locationfrom information comprising portions of the set of stored informationthat are not identified as the redundant portions, and portions of thestored information whose content has changed after the respective hashis generated; and transferring the active status from the plurality ofvirtual machines to the reconstituted plurality of virtual machines,such that the reconstituted plurality of virtual machines at the remotelocation assume processing of the workload from the plurality of virtualmachines at the local facility.